PPI logo P. s. pv. tomato DC3000 P. s. pv. phaseolicola 1448A
Pto DC3000 re-annotation proposal

genome resources
home page

Goal

To comprehensively and systematically capture the extensive bioinformatic and experimental characterization of the Pto DC3000 genome in the annotation records at NCBI

Why this is needed

  1. The P. syringae research community does not maintain an autonomous database of genome annotation records. As a result, the records at NCBI represent the primary destination for annotation updates and referral.
  2. As more genomes related to Pto DC3000 are sequenced, availability of a thoroughly annotated reference genome will substantially enhance the quality of annotation for related genomes.
  3. The genome annotation records at NCBI represent the raw material for development of sequence motif models and annotation of new genomes using BLAST similarity. Comprehensive incorporation of high quality, manually reviewed data into the annotation records for Pto DC3000 will substantially leverage the quality of analyses based upon these records.

The following proposal is generated with the goal of reaching a mutually acceptable agreement on how best to format features and qualifiers to capture the data at hand.

Topics covered

Proposal for wider application of the locus tag qualifier

  • Rationale
  • Suggested formats

Capture of version changes

  • For the entire annotation record
  • For individual gene/CDS or other records with assigned locus tags

Capture of RNA and protein sequence data

  • By addition of qualifiers to CDS features
  • By addition of independent features (e.g. misc_RNA and misc_peptide)

Examples of proposed formats for individual feature types likely to change with re-annotation:

promoters
protein_bind
ncRNA
mRNA
repeat_region

  • generic
  • containing mobile elements

CDS
misc_RNA
misc_peptide (newly proposed)

_____________________________________________________________________________________________________

Proposal for wider application of the locus tag qualifier

A clear means of numbering features using a field other than the /note field is being requested for the following feature types:
promoters, protein_bind, mRNA, ncRNA.

The /locus_tag qualifier, though traditionally used only for genes, appears to be the qualifier best suited to meeting this need

Rationale:

  1. “Locus” is commonly defined as any discrete physical location on the DNA molecule. The promoter, binding site, mRNA and ncRNA features being proposing have been generated using published models and most are experimentally confirmed. As such, they have superior documentation and a greater demonstrated impact on organismal biology than the vast majority of gene “loci” to which locus tags are currently assigned. Note too that the use of a lowercase letter to distinguish the type of locus tag (see suggested formats below) is consistent with the format already in use for rRNA and tRNA (PSPTO_r01 and PSPTO_t01).

  2. Given their significance to the research community, there is a real need for unique identifiers that can be used to navigate annotation records and distinguish among individual promoters as is currently done for gene loci.

  3. The “catch-all” note field is inadequate as a location for unique identifiers attached to individual features.  For example, the locus tag is typically used as the displayed feature label when Genbank files are loaded into standard genome viewers.   While the identity of the displayed field can be changed, the length and non-standard format of the note field makes it unsuitable as a feature label.   Permitting locus tags for the following features would provide a single field that for use in rapidly distinguishing individuals.

Suggested formats for locus tags of different features:

promoters PSPTO_p#### (example of complete feature)
protein_bind PSPTO_pb#### (example of complete feature)
mRNA PSPTO_mr#### (example of complete feature)
ncRNA PSPTO_ncr#### (example of complete feature)

Capture of version changes

For the entire annotation record: (DISREGARD as of 09-02-25: Proposal for changing the version number for the entire annotation record cannot be accommodated by current EMBL/DDBJ/GenBank policy and is a suggested change of relatively low priority)

Updates to the P. syringae annotation records are made incrementally and so no version numbers have been incorporated into the genome accession to date.  However, the major change currently being planned for the Pto DC3000 annotation record, encompassing data from RNA and peptide sequencing and systematic incorporation of new features and fields, is of sufficient magnitude to warrant a change in the version number of the accession from AE016853 to AE016853.2

For individual gene/CDS or other records with assigned locus tags

In those cases where the strand of the predicted feature is changed or, in the case of a CDS, the frame of the feature is changed, the existing locus tag will be obsoleted and a new unique locus tag generated for the new feature. Reference to the obsoleted locus tag can be incorporated into the note field.

If a change involves the start site alone, the locus tag number will not be changed

Capture of RNA and protein sequence data

By addition of qualifiers to CDS features

Incorporation of evidence for transcription and protein production represents a significant element of the proposed enhancement of the Pto DC3000 annotation record and in many cases will provide the first clear validation of otherwise uncharacterized proteins (such as those annotated as "hypothetical" or "conserved hypothetical"). Evidence for transcribed or translated product corresponding to CDS features will be captured using the /evidence qualifier (see example)

By addition of independent features (e.g. misc_RNA and misc_peptide)

In addition to incorporating protein and RNA sequence data by adding an /evidence qualifier to existing features, the sequenced peptides and transcripts may also be captured as independent features if desired by NCBI. At present there is no feature suitable for capturing the sequenced peptides which, in most cases, are unlikely to represent mature peptides (mat_peptide). Creation of a new feature such as misc_peptide is proposed.

With regard to the majority of RNA sequences, the features prim_transcript, mRNA or ncRNA are not appropriate given that the 3' end of the transcript has not been mapped. Annotation as misc_RNA is proposed.

Examples of proposed formats for individual feature types likely to change with re-annotation

promoter

coordinates
/locus tag=”PSPTO_p####”
/bound_moiety =”[sigma factor name]”
/inference="ab initio prediction:[algorithm name]"
and/or
/experiment="microarray and/or RT-PCR"
/note=”PMID:######”

protein_bind

coordinates
/locus tag=”PSPTO_pb####
/bound_moiety =”[protein name] ”
/inference="ab initio prediction:[algorithm name]"
and/or
/experiment="microarray and/or RT-PCR"
/note=”PMID:######”

ncRNA

coordinates
/locus tag=”PSPTO_nc####”
/experiment="RNA sequencing"

mRNA

coordinates
/locus tag=”PSPTO_mr####”
/experiment="RNA sequencing"

repeat_region

generic repeats:

coordinates
/inference="ab initio prediction:[algorithm name]"

/note=”PMID:######”

repeats containing mobile elements:

coordinates
/mobile_element="insertion sequence:IS##"

/note=”PMID:######”

CDS

coordinates
/codon start=1
/transl_table=11
/locus tag=”PSPTO_####

Qualifier changes and additions as appropriate to the CDS in question:

/product="protein of unknown function" or "conserved protein of unknown function"
(if previously "hypothetical" or "conserved hypothetical")

/inference="ab initio prediction:[algorithm]"
(if a newly called gene)

/experiment="RNA sequencing"

/experiment="peptide sequencing"

/note=”PMID:######”

(to capture the publication detailing the experimental evidence)

misc_RNA
(to capture regions for which RNA sequences have been generated but the complete mRNA or ncRNA not yet established)

coordinates
/experiment="RNA sequencing"
/note=”PMID:######”

misc_peptide
(newly proposed as a means of capturing peptide sequence data)

coordinates
/experiment="peptide sequencing"
/note=”PMID:######”


Magdalen Lindeberg
PPI Project Coordinator
Dept Plant Pathology
Cornell University
contact